Solving  Kinetic  Equations  on  GPU’s 

Aldo  Frezzotti,  Gian  Pietro  Ghiroldi  and  Livio  Gibelli 


ORGANIZATION 


Contents 

1  Introduction  3 

2  Mathematical  Formulation  5 

3  Outline  of  the  numerical  method  7 

4  GPU  Implementation  9 

4.1  GPU  and  CUDA™  overview .  9 

4.2  Implementation  details .  10 

5  Test  case:  driven  cavity  flow  13 

5.1  Formulation  of  the  problem  .  13 

5.2  Results  and  discussion .  13 

6  Conclusions  21 

7  Acknowledgments  23 

8  Appendix:  CUDA  pseudo-codes  27 


*Dipartimento  di  Matematica  del  Politecnico  di  Milano 
Piazza  Leonardo  da  Vinci  32,  20133  Milano,  Italy 


RTO-EN-AVT-194 


7-1 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

JAN  2011 

2.  REPORT  TYPE 

N/A 

3.  DATES  COVERED 

4.  TITLE  AND  SUBTITLE 

Solving  Kinetic  Equations  on  GPU’s 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Dipartimento  di  Matematica  del  Politecnico  di  Milano  Piazza  Leonardo 
da  Vinci  32,  20133  Milano,  Italy 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release,  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

See  also  ADA579248.  Models  and  Computational  Methods  for  Rarefied  Flows  (Modeles  et  methodes  de 
calcul  des  coulements  de  gaz  rarefies).  RTO-EN-AVT-194 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

SAR 

18.  NUMBER 
OF  PAGES 

30 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


1  INTRODUCTION 


1  Introduction 

Non-equilibrium  gas  flows  are  met  in  several  different  physical  situations  ranging  from 
the  re-entry  of  spacecraft  in  upper  planetary  atmospheres  to  fluid-structure  interaction  in 
small-scale  devices  Cercignani  (1988);  Bird  (1994).  The  correct  description  of  nonequi¬ 
librium  effects  requires  replacing  the  traditional  hydrodynamic  equations  with  the  Boltz¬ 
mann  equation  which,  in  the  absence  of  assigned  external  force  fields,  reads 

§£+< »-V,/  =  C(/,/)  (1) 

In  Eq.  (1),  the  distribution  function  f(r,v\t)  is  the  atomic  number  density  at  the  single 
atom  phase  space  point  (r,  v)  at  time  t.  The  symbols  r  and  v  denote  atom  position  and 
velocity,  respectively.  The  left  hand  side  of  Eq.  (1)  represents  the  rate  of  change  of  /  due 
to  the  indipendent  motion  of  gas  atoms.  Effects  of  collisions  are  accounted  for  by  the 
source  term  C(f,  f)  which  is  a  non-linear  functional  of  /  whose  precise  structure  depends 
on  the  assumed  atomic  interaction  forces.  Obtaining  numerical  solutions  of  Eq.  (1)  for 
realistic  flow  conditions  is  a  challenging  task  because  it  has  the  form  of  a  non-linear 
integro-differential  equation  in  which  the  unknown  function,  /,  depends  on  seven  variables. 
Numerical  methods  used  to  solve  Eq.  (1)  can  be  roughly  divided  into  three  groups: 

(a)  Particle  methods 

(b)  Semi-regular  methods 

(c)  Regular  methods 

Methods  in  group  (a)  originate  from  the  Direct  Simulation  Monte  Carlo  (DSMC)  scheme 
proposed  by  G.A.  Bird  Bird  (1994).  They  are  by  far  the  most  popular  and  widely  used 
simulation  methods  in  rarefied  gas  dynamics.  The  distribution  function  is  represented  by 
a  number  of  mathematical  particles  which  move  in  the  computational  domain  and  collide 
according  to  stochastic  rules  derived  from  Boltzmann  equation.  Macroscopic  flow  prop¬ 
erties  are  usually  obtained  by  time  averaging  particle  properties.  If  the  averaging  time  is 
long  enough,  then  accurate  flow  simulations  can  be  obtained  by  a  relatively  small  number 
of  particles.  The  method  can  be  easily  extended  to  deal  with  mixtures  of  chemically  react¬ 
ing  polyatomic  species  Bird  (1994)  and  to  dense  fluids  Frezzotti  et  al.  (2005).  Although 
DSMC  (in  its  traditional  implementation)  is  to  be  recommended  in  simulating  most  of 
rarefied  gas  flows,  it  is  not  well  suited  to  the  simulation  of  low  Mach  number  or  unsteady 
flows.  Attempts  have  been  made  to  extend  DSMC  in  order  to  improve  its  capability 
to  capture  the  small  deviations  from  the  equilibrium  condition  met  in  low  Mach  number 
flows  Homolle  and  Hadjiconstantinou  (2007);  Wagner  (2008).  However,  in  simulating  high 
frequency  unsteady  flows,  typical  of  microfluidics  application  to  MEMS,  the  possibility  of 
time  averaging  is  lost  or  reduced.  Acceptable  accuracy  can  then  be  achieved  by  increasing 
the  number  of  simulation  particles  or  superposing  several  flow  snapshots  obtained  from 
statistically  independent  simulations  of  the  same  flow;  in  both  cases  the  computing  effort 
is  considerably  increased. 

Methods  in  groups  (b)  and  (c)  adopt  similar  strategies  in  discretizing  the  distribution 
function  on  a  regular  grid  in  the  phase  space  and  in  using  finite  difference  schemes  to 
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approximate  the  streaming  term.  However,  they  differ  in  the  way  the  collision  integral  is 
evaluated.  In  semi-regular  methods  C(f,f)  is  computed  by  Monte  Carlo  or  quasi  Monte 
Carlo  quadrature  methods  Frezzotti  (1991);  F.  (2005)  whereas  deterministic  integration 
schemes  are  used  in  regular  methods  Aristov  (2001).  Whatever  method  is  chosen  to  com¬ 
pute  the  collision  term,  the  adoption  of  a  grid  in  the  phase  space  considerably  limits 
the  applicability  of  methods  (b)  and  (c)  to  problems  where  particular  symmetries  reduce 
the  number  of  spatial  and  velocity  variables.  As  a  matter  of  fact,  a  spatially  three- 
dimensional  problem  would  require  a  memory  demanding  six-dimensional  phase  space 
grid.  Extensions  to  polyatomic  gases  are  possible  Frezzotti  (2007)  but  the  necessity  to 
store  additional  variables  associated  with  internal  degrees  of  freedom  further  limits  the 
applications  to  multi-dimensional  flows.  Therefore,  until  now  the  direct  solution  of  the 
Boltzmann  equation  by  semi-regular  or  regular  methods  has  not  been  considered  a  viable 
alternative  to  DSMC  for  simulating  realistic  flows,  not  even  for  low  speed  and/or  unsteady 
flows.  The  availability  of  low  cost  Graphics  Processing  Units  (GPUs)  has  changed  the  sit¬ 
uation.  Although  GPUs  have  been  originally  developed  for  graphics  rendering,  they  have 
became  general  purpose  desktop  supercomputers  capable  of  delivering  teraflops  peak  per¬ 
formance  at  the  price  of  conventional  workstations.  Mapping  efficiently  an  algorithm  on 
the  SIMD-like  architecture  of  the  GPUs,  however,  is  a  difficult  task  which  often  requires 
the  algorithm  to  be  revised  or  even  redesigned  to  both  balance  the  hardware  structure 
benefits  and  meet  the  implementation  requirements.  For  instance,  preliminary  tests,  per¬ 
formed  within  the  framework  of  the  research  work  described  here,  have  shown  that  the 
standard  form  of  DSMC  is  not  efficiently  ported  on  GPU’s  because  of  their  SIMD  archi¬ 
tecture.  On  the  other  hand,  we  have  shown  in  Ref.  Frezzotti  et  al.  (2010),  that  a  regular 
method  of  solution  of  the  BGKW  kinetic  model  equation  is  ideally  suited  for  GPUs.  The 
main  aim  of  the  present  paper  is  to  translate  efficiently  a  semi-regular  method  of  solu¬ 
tion  of  the  full  non-linear  Boltzmann  equation  into  a  parallel  code  to  be  executed  on  a 
GPU.  The  efficiency  of  the  algorithm  is  assessed  by  solving  the  classical  two-dimensional 
driven  cavity  flow.  It  is  shown  that  it  is  possible  to  cut  down  the  computing  time  of  the 
sequential  codes  of  two  order  of  magnitudes.  This  paper  is  organized  as  follows.  Sections 
2  and  3  are  devoted  to  a  concise  description  of  the  mathematical  model  and  the  adopted 
numerical  method.  In  Section  4  the  key  aspects  of  the  GPU  hardware  architecture  and 
CUDA™  programming  model  are  briefly  described  and  implementation  details  are  pro¬ 
vided.  Sections  5  is  devoted  to  the  description  of  the  test  problem  and  the  discussion  of 
the  results.  Concluding  remarks  are  presented  in  Section  6. 
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2  Mathematical  Formulation 


The  hard-sphere  model  is  a  good  approximation  for  simple  fluids,  that  is  fluids  whose 
properties  are  largely  determined  by  harshly  repulsive  short  range  forces.  The  hard-sphere 
Boltzmann  collision  integral  reads 

2  p 

C(f,f)  =  y  /  (/*/r  -  // 1)  |fe  ■  vAdv^k  (2) 

In  Eq.  (2),  cr  is  the  hard  sphere  diameter,  vr  =  v  —  V\  is  the  relative  velocity  between 
two  colliding  atoms  and  f*  =  f(r,  v*\t),  /*  =  f{r,v\\t),  f\  =  f(r,Vi\t).  Here  and  in  the 
remainder  of  the  paper,  integration  extends  over  the  whole  velocity  space.  Similarly,  the 
solid  angle  integration  is  over  the  surface  of  the  unit  sphere,  whose  points  are  associated 
with  the  unit  vector  k.  The  pre-collisional  velocities,  (v*,vi)>  are  obtained  from  the 
post-collision  velocities,  (v,  iq),  and  the  unit  vector  on  the  sphere,  k ,  by  the  relationships 

v*  =  v  +  ^ vr  ■  k^j  k  (3) 

v\  =  tq  —  (vr  ■  k^j  k  (4) 

In  view  of  the  applications  to  the  study  of  low  Mach  flows,  Refs.  Homolle  and  Hadjicon- 
stantinou  (2007);  Baker  and  Hadjiconstantinou  (2008)  will  be  followed  to  rewrite  Eqs.  (1) 
and  (2)  in  terms  of  the  deviational  part  of  the  distribution  function,  h(r,v\t),  defined  as 

/  =  <Lo  (1  +  eh)  (5) 


where  e  is  a  parameter  that  measures  the  deviation  from  equilibrium  conditions  and 
$0(r,u|t)  is  the  Maxwellian  at  equilibrium  with  uniform  and  constant  density  no  and 
temperature  T0,  i.e., 


<Lo  — 


n0 

- ^  exp 

(2ir  RTaf 


(6) 


The  physical  rationale  behind  this  formulation  is  a  proper  rescaling  of  the  (small)  devia¬ 
tion  from  equilibrium  to  reduce  the  variance  in  the  Monte  Carlo  evaluation  of  the  collision 
integral  and  thus  to  capture  arbitrarily  small  deviations  from  equilibrium  with  a  compu¬ 
tational  cost  which  is  independent  of  the  magnitude  of  the  deviation.  By  substituting 
Eq.  (5)  into  Eq.  (1),  we  obtain 


-^  +  v-Vrh=Q{h,h)  (7) 

where  the  collision  integral  takes  the  form 

rf2  f  - 

Q(h,  h)  =  —  /  $o  $o,i  [h*  +  h*1-h-  h1  +  e  (h*h*  -  hht)]  | k  •  vr\dVld2k  (8) 

Eq.  (8)  has  been  obtained  by  using  the  property  $g  $qX  =  $0  $oi  Cercignani  (1988). 

For  later  reference,  we  here  report  the  expressions  of  dimensionless  perturbed  density, 
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velocity,  temperature  and  stress  tensor 


P 

u 

9 

nP 


=  Uohdv 

no  e  J 


V  1 


1 


\/2RT0  €  1  T  6  p 

T-T0 1  1  (1 


Tn  hv  dv 


Tn 


1  +  e p  \3 


-  /  T0  h  v  dv-  p)  -  -u 


Pij  Po^ij  1  /"  i  ;  j  2 

-  -  =  /  <t>o  n  ViVj  dv  —  eUiUj  —  e  p  Ui  Uj 


P  o 


(9) 

(10) 

(11) 

(12) 


where  p0  =  noRT0.  At  the  boundaries,  Maxwell’s  completely  diffuse  boundary  condition 
is  assumed.  Accordingly,  the  distribution  function  of  atoms  emerging  from  walls  is  given 
by  the  following  expression 


T0  +  e  T0  h  =  nw  Tu,  (v  -  Vw)  ■  n  >  0  (13) 

In  Eq.  (13),  n  is  the  inward  normal  and  Tm  is  the  normalized  wall  Maxwellian  distribution 
function 


<Mr’,'y) 


(v-Vw)2' 

2RTW 


(14) 


where  Vw  the  wall  velocity  and  Tw  the  wall  temperature.  The  wall  density  nw  is  deter¬ 
mined  by  imposing  zero  net  mass  flux  at  any  boundary  point 


n, 


(15) 


where  c  =  v  —  Vw.  It  is  worth  noticing  that  when  the  perturbation  is  sufficiently  small, 
i.e.,  e  — >  0,  Eq.  (7)  reduces  to  the  linearized  Boltzmann  equation  and  Eqs.  (9)-(12)  to 
the  linearized  expression  of  the  macroscopic  quantities.  The  formulation  in  terms  of  the 
deviational  part  of  the  distribution  function,  however,  is  not  restricted  to  a  vanishing 
perturbation  but  it  is  valid  in  the  non-linear  case  as  well. 
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3  Outline  of  the  numerical  method 


The  method  of  solution  adopted  to  solve  Eq.  (7)  is  a  semi-regular  method  in  which  a 
finite  difference  discretization  are  used  to  evaluate  the  free  streaming  term  on  the  left 
hand  side  while  the  collision  integral  on  the  right  hand  side  is  computed  by  a  Monte 
Carlo  technique.  The  three-dimensional  physical  space  is  divided  into  Nr  =  Nx  x  Nyx  Nz 
parallelepipedal  cells.  Likewise,  the  three-dimensional  velocity  space  is  replaced  by  a 
parallelepipedal  box  divided  into  Nv  =  NVx  x  NVy  x  NVz  cells.  The  size  and  position  of  the 
“velocity  box”  in  the  velocity  space  have  to  be  properly  chosen,  in  order  to  contain  the 
significant  part  of  h  at  any  spatial  position.  The  distribution  function  is  assumed  to  be 
constant  within  each  cell  of  the  phase  space.  Hence,  h  is  represented  by  the  array  hij  (t)  = 
h(x(ix),y(iy),z(iz),vx(jx),vv(jy),vz(jz)\t);  x{ix),y{iy),z{iz)  and  vx(jx),vy(jy),vz(jz)  are 
the  values  of  the  spatial  coordinates  and  velocity  components  in  the  center  of  the  phase 
space  cell  corresponding  to  the  indexes  i  =  (ix,iy,iz)  and  j  =  (jx,jy,jz)- 
The  algorithm  that  advances  hr-j  =  hij(tn )  to  h 7+1  =  hij(tn  +  At)  is  constructed  by 
time-splitting  the  evolution  operator  into  a  free  streaming  step,  in  which  the  right  hand 
side  of  Eq.  (7)  is  neglected,  and  a  purely  collisional  step,  in  which  spatial  motion  is  frozen 
and  only  the  effect  of  the  collision  operator  is  taken  into  account.  More  precisely,  the 
distribution  function  h^j  is  advanced  to  htC1  by  computing  an  intermediate  value,  h™ d1, 
from  the  free  streaming  equation 

dh 

—  +  v  ■  Wrh  =  0  (16) 

When  solving  Eq.  (16),  boundary  conditions  have  to  be  taken  into  account.  Eq.  (16)  is 
discretized  by  a  simple  first  order  explicit  upwind  conservative  scheme.  For  later  reference, 
we  here  report  the  difference  scheme  in  the  two  dimensional  case  with  vx  >  0  and  vy  >  0 


K,%J  =  (1  -  <X  -  Cu ^  +  <X  hi_ hiylbj  +  Cut  /c„-w 


(17) 


In  Eq.  (17),  Ciia,  =  vx(jx)At/ Ax  and  Cu^  =  vy(jy)At/ Ay  are  the  Courant  numbers  in 
the  x  and  y  directions,  respectively. 

After  completing  the  free  streaming  step,  h7Al  is  obtained  by  solving  the  homogeneous 
relaxation  equation 

dh 

Yf  =  Q{k,  h)  (18) 

where  Q(h,h)  is  given  by  Eq.  (8).  In  order  to  be  solved,  Eq.  (18)  is  first  integrated  over 
the  cell  of  the  velocity  space  Cj 


=  j  Q(h ,  h)dv  (19) 

JCj 

where  N^j  represents  the  deviation  of  the  number  of  particles  with  position  rt  in  the 
velocity  cell  centered  around  the  velocity  node  j  with  respect  to  its  mean  value  at  equilib¬ 
rium,  i.e.,  Nij  ic  AVj  ‘hoy  hij  with  AVj  the  volume  of  the  velocity  cell  Cj.  The  integral 
in  Eq.  (19)  is  then  transformed  into  an  integral  extended  to  the  whole  velocity  domain  V 


d^jjl  =  j  Xj  Q(h,  h)  dv 


(20) 
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where  Xj  is  the  characteristic  function  of  the  cell  Cj 

1  v  e 
0  v  £  Cj 


(21) 


Making  use  of  some  fundamental  properties  of  the  collision  integral  Cercignani  (1988), 
Eq.  (20)  can  be  written  in  the  following  form 


(IN-  ■  d2  f  f1  f2rr 

=  -  J  dv  dv,  $0(«)  *0M  j  i  dkz  j I  # 

\Xj(v*)  +  Xj(v I)  “  Xj(v)  ~  Xj(v i)]  [h(v)  +  h(v,)  +  eh{v)h{v,)]  | k  ■  vr\  (22) 

The  eight-fold  integral  in  Eq.  (22)  is  calculated  by  a  Monte  Carlo  quadrature  method, 
since  a  regular  quadrature  formula  would  be  too  demanding  in  term  of  computing  time. 
The  advantage  of  writing  the  rate  of  change  of  Njj  in  the  above  form  is  that  the  gaussian 
distribution  function  $0  may  be  considered  a  probability  density  function  from  which 
the  velocity  points  are  drawn  to  estimate  the  collision  integral  with  lower  variance.  The 
Monte  Carlo  estimate  of  the  integral  on  the  right  hand  side  of  Eq.  (22)  gives 


dNjj 

dt 


nld27r 


Nt 

£ 

;=i 


[Xj(v*)  +  Xj(v*u)  -  Xj(vi)  -  Xj(vu)] 


[h{vi)  +  h(vu)  +eh(vi)  h(vn)}  \k  ■  vr\  (23) 

where  Nt  is  the  number  of  velocity  samples  Frezzotti  (1991).  It  is  worth  noticing  that 
the  same  set  of  collisions  can  be  used  to  evaluate  the  collision  integral  at  different  space 
locations.  Once  the  collision  integral  have  been  evaluated,  the  solution  is  advanced  from 
the  n-th  time  level  to  the  next  according  to  the  explicit  scheme 


n+1 


hn+L  =  hn  T 


+  Q'-  j  At 


where 


Qn  ■  =  - 

^  A  V,-  <!>,■ 


dNij 


dt 


(24) 


(25) 


Although  memory  demanding,  the  method  outlined  above  produces  accurate  approxima¬ 
tions  of  the  solution  which  do  not  require  time  averaging  to  provide  smooth  macroscopic 
fields.  A  drawback  of  the  technique  is  that,  due  to  the  discretization  in  the  velocity  space, 
momentum  and  energy  are  not  exactly  conserved.  The  numerical  error  is  usually  small 
but  tends  to  accumulate  during  the  time  evolution  of  the  distribution  function.  The  cor¬ 
rection  procedure  proposed  in  Ref.  Aristov  and  G.  (1980)  has  been  adopted  to  overcome 
this  difficulty.  At  each  time  step  the  full  distribution  function  is  corrected  in  the  following 
way 

$op  (1  +  e  hi J1)  =  (1  +  e  h7Al)  [1  +  A  +  B-V  +  Cv2]  (26) 

where  the  constants  A,  B  and  C  are  determined  from  the  conditions 


<h(n)  hn+l(v)  dv=  ip(v)  $(n)  hn+1(v)  dv 


(27) 


being  i^{v)  =  l,v,v2.  The  correction  procedure  given  by  Eq.  (26)  involves  the  full 
distribution  function  and  not  its  deviational  part  in  order  the  linear  system  (27)  to  be 
well  conditioned. 
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4  GPU  Implementation 

4.1  GPU  and  CUDA™  overview 

NVIDIA®GPU  is  built  around  a  fully  programmable  processors  array  organized  into  a 
number  of  multiprocessors  with  a  SIMD-likc  architecture  NVIDIA  Corporation  (2008),  i.e. 
at  any  given  clock  cycle,  each  core  of  the  multiprocessor  executes  the  same  instruction  but 
operates  on  different  data.  CUDA™  is  the  high  level  programming  language  specifically 
created  for  developing  applications  on  this  platform. 

A  CUDA™  program  is  organized  into  a  serial  program  which  runs  on  the  host  CPU  and 
one  or  more  kernels  which  define  the  computation  to  be  performed  in  parallel  by  a  mas¬ 
sive  number  of  threads.  Threads  are  organized  into  a  three-level  hierarchy.  At  the  highest 
level,  all  threads  form  a  grid;  they  all  execute  the  same  kernel  function.  Each  grid  consists 
of  many  different  blocks  which  contain  the  same  number  of  threads.  A  single  multipro¬ 
cessor  can  manage  a  number  of  blocks  concurrently  up  to  the  resource  limits.  Blocks 
are  independent,  meaning  that  a  kernel  must  execute  correctly  no  matter  the  order  in 
which  blocks  are  run.  A  multiprocessor  executes  a  group  of  threads  beloging  to  the  active 
block,  called  warp.  All  threads  of  a  warp  execute  the  same  instruction  but  operate  on 
different  data.  If  a  kernel  contains  a  branch  and  threads  of  the  same  warp  follow  different 
paths,  then  the  different  paths  are  executed  sequentially  (warp  divergence)  and  the  total 
run  time  is  the  sum  of  all  the  branches.  Divergence  and  re-convergence  are  managed  in 
hardware  but  may  have  a  serious  impact  on  performances.  When  the  instruction  has 
been  executed,  the  multiprocessor  moves  to  another  warp.  In  this  manner  the  execution 
of  threads  is  interleaved  rather  than  simultaneous. 

Each  multiprocessor  has  a  number  of  registers  which  are  dynamically  partitioned  among 
the  threads  running  on  it.  Registers  are  memory  spaces  that  are  readable  and  writable 
only  by  the  thread  to  which  they  are  assigned.  Threads  of  a  single  block  are  allowed  to 
synchronize  with  each  other  and  are  available  to  share  data  through  a  high-speed  shared 
memory.  Threads  from  different  blocks  in  the  same  grid  may  coordinate  only  via  opera¬ 
tions  in  a  slower  global  memory  space  which  is  readable  and  writable  by  all  threads  in  a 
kernel  as  well  as  by  the  host.  Shared  memory  can  be  accessed  by  threads  within  a  block  as 
quickly  as  accessing  registers.  On  the  contrary,  I/O  operations  involving  global  memory 
are  particularly  expensive,  unless  access  is  coalesced  NVIDIA  Corporation  (2008).  Be¬ 
cause  of  the  interleaved  warp  execution,  memory  access  latency  is  partially  hidden,  i.e., 
threads  which  have  read  their  data  can  be  performing  computations  while  other  warps 
running  on  the  same  multiprocessor  are  waiting  for  their  data  to  come  in  from  global 
memory.  Note,  however,  that  GPU  global  memory  is  still  ten  time  faster  than  the  main 
memory  of  recent  CPUs. 

Code  optimization  is  a  delicate  task.  In  general,  applications  which  require  many  arith¬ 
metic  operations  between  memory  read/write,  and  which  minimize  the  number  of  out-of- 
order  memory  access,  tend  to  perform  better.  Number  of  blocks  and  number  of  threads  per 
block  have  to  be  chosen  carefully.  There  should  be  at  least  as  many  blocks  as  multiproces¬ 
sors  in  the  device.  Running  only  one  block  per  multiprocessor  can  force  the  multiprocessor 
to  idle  during  thread  synchronization  and  device  memory  reads.  By  increasing  the  num¬ 
ber  of  blocks,  on  the  other  hand,  the  amount  of  available  shared  memory  for  each  block 
diminishes.  Allocating  more  threads  per  block  is  better  for  efficient  time  slicing,  but  the 
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more  threads  per  block,  the  fewer  registers  are  available  per  thread. 

The  computations  that  are  shown  below,  have  been  performed  on  a  commercially  avail¬ 
able  GPU  GeForce  GTX  260  produced  by  NVIDIA®  using  CUDA™  version  2.0.  The 
GTX  260  GPU  model  consists  of  24  streaming  multiprocessors  with  8  streaming  proces¬ 
sors  (SP)  each  for  a  total  of  192  units.  Each  SP  is  clocked  at  1.242  GHz  and  performs  up 
to  3  floating  point  operation  (FLOP)  per  clock  cycle,  yielding  a  peak  theoretical  perfor¬ 
mance  of  715.4  GFLOPs  (192  x  1.242  x  3).  Each  group  of  SP  shares  one  16  kB  of  fast 
per-block  shared  memory  while  the  GPU  has  896  MB  of  device  memory  with  a  memory 
bandwidth  of  111.9  GB/s.  The  graphic  processing  unit  has  been  hosted  by  a  personal 
computer  equipped  with  4  GB  of  main  memory  and  an  Intel®  Core  Duo  Quad  Q9300 
CPU,  running  at  2.5  GHz.  The  host  machine  has  also  been  used  to  run  the  sequential 
version  of  the  program  to  obtain  the  speed-up  data.  The  host  code  has  been  compiled 
using  the  gcc/g++  compiler  with  optimization  option  “-03”. 

4.2  Implementation  details 

The  code  to  numerically  solve  Eq.  (7)  is  organized  into  a  host  program,  which  deals  with 
all  memory  management  and  other  setup  tasks,  and  three  kernels  running  on  the  GPU 
which  perform  the  streaming  and  the  collision  steps.  In  the  following,  we  report  and 
discuss  the  pseudo-codes  of  each  kernel.  Because  of  their  different  impact  on  the  code 
performance,  we  distinguish  the  slow  global  memory  reads,  4=,  and  writes,  =^,  from  the 
fast  reads,  and  writes,  — »,  from  local  registers  and  shared  memory. 

Algorithm  (1)  reports  the  pseudocode  of  the  two  dimensional  streaming  kernel.  The  one¬ 
dimensional  case  has  been  discussed  in  Ref.  Frezzotti  et  al.  (2010)  whereas  the  extension 
to  three-dimensional  geometries  is  straightforward.  Moreover,  for  clarity  of  presentation, 
the  pseudo-code  of  the  streaming  kernel  refers  to  one  cell  of  the  velocity  space  with  vx  >  0 
and  vy  >  0.  The  other  cases  can  be  handled  analogously.  As  shown  by  Eq.  (17),  for  each 
cell  of  the  velocity  domain,  the  streaming  step  involves  the  distribution  function  evaluated 
at  different  space  locations.  Similarly  to  the  one  dimensional  case,  the  key  performance 
enhancing  strategy  is  to  allow  threads  to  cooperate  in  the  shared  memory.  In  order  to 
fit  into  the  device’s  resources,  blocks  are  composed  by  a  two  dimensional  grid  of  threads 
with  dimension  Bx  x  By  having  each  thread  associated  with  one  cell  of  the  physical  space. 
When  a  block  become  active,  each  thread  loads  one  element  of  the  distribution  function 
from  global  memory,  stores  it  into  shared  memory  (line  5),  updates  its  value  according  to 
Eqs.  (17)  (line  21)  and  then  saves  it  back  to  the  global  memory  (line  22).  This  procedure 
is  repeated  sequentially  (. Nx/Bx  —  1)  x  ( Ny/By  —  1)  times.  To  ensure  non-overlapping 
access,  threads  are  synchronized  at  the  onset  of  writing  to  the  global  memory  (lines  20). 
In  order  to  obtain  a  coalesced  access  to  the  global  memory,  values  of  the  discretized  distri¬ 
bution  function  of  cells  which  are  adjacent  in  the  physical  space  are  stored  in  contiguous 
memory  locations.  However,  not  all  the  threads  in  a  block  can  read  data  in  a  coalescent 
manner.  In  fact,  in  order  to  update  hij  the  values  of  the  distribution  function  of  two 
upwind  neighboring  nodes,  often  referred  to  as  “halo”  nodes  Micikevivius  (2009),  are  re¬ 
quired.  The  halos  on  one  physical  direction  can  be  read  in  with  coalesced  access  (line 
13-19)  while  the  others  have  to  be  read  in  with  non-coalesced  access  (line  6-12).  Threads 
which  update  boundary  points  perform  calculations  which  are  slightly  different  to  account 
for  the  incoming  Maxwellian  flux  from  the  boundaries  of  the  domain  (lines  8  and  15). 
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The  relaxation  step  is  organized  into  two  kernels  whose  pseudo-codes  are  listed  in  Algo¬ 
rithms  (2)  and  (3).  The  first  kernel  computes  the  sequence  of  Nt  collisions  used  in  the 
Monte  Carlo  evaluation  of  the  collision  integral.  The  second  kernel  updates  the  discretized 
distribution  function,  executes  the  correction  procedure  and  computes  the  macroscopic 
quantities  of  interest  as  well. 

Algorithm  (2)  reports  the  pseudo-code  of  the  sampling  kernel.  Here,  there  are  as  many 
threads  as  the  number  of  the  collision  samples,  Nt.  Firstly,  each  thread  generates  the 
pre-collisional  velocities  v  and  V\  by  sampling  the  maxwellian  distribution  function  with 
the  Box-Muller  algorithm  (line  1-2)  and  the  unit  vector  k  by  sampling  the  uniform  distri¬ 
bution  on  the  unit  sphere  (line  3).  Afterwards,  the  post-collisional  velocities  are  evaluated 
(line  4-6)  and  the  index  of  the  velocity  cells  to  which  they  belong  are  calculated  and  stored 
in  the  vectors  /,  I\,  I*  and  Jj"  defined  in  the  global  memory  (lines  7-11).  Finally,  for  each 
velocity  cell,  the  values  to  be  added  and  subtracted  to  these  velocity  cells  are  calculated 
(lines  11-14)  and  stored  (lines  15-18)  in  the  vectors  C,  C\ ,  C*  and  defined  in  the 
global  memory.  It  is  important  to  note  that  in  order  to  maximize  the  performance  all 
the  accesses  to  the  global  memory  are  done  in  a  coalesced  manner  NVIDIA  Corporation 
(2008). 

To  update  the  discretized  distribution  function  in  a  cell  of  the  physical  space  according 
to  Eq.  (24),  no  information  from  nearby  space  cells  is  required.  This  naturally  fits  for 
the  GPU,  where  one  may  define  as  many  threads  as  the  number  of  cells  in  the  physical 
space.  Moreover,  by  having  one  thread  for  each  cell  of  the  physical  space,  potentially 
dangerous  conflicts  between  threads  are  avoided  and  the  accesses  to  the  global  memory 
may  be  coalesced.  Firstly,  each  thread  updates  the  discretized  distribution  function  ac¬ 
cording  to  Eq.  (23),  then  executes  the  correction  procedure  to  enforce  the  conservation 
of  momentum  and  energy,  Eq.  (26),  and  finally  compute  the  macroscopic  quantities  of 
interest.  Algorithm  (3)  shows  the  pseudo-code  of  the  relaxation  kernel.  The  main  part 
of  the  algorithm  is  in  between  the  lines  7  and  21  where  the  collision  integral  is  evaluated 
according  to  the  Monte  Carlo  method,  Eq.  (23).  Lines  from  1  to  6  and  from  22  to  30 
evaluate  the  required  moments  of  the  distribution  function  before  and  after  the  collision 
step,  respectively.  These  moments  are  used  in  Eq.  (27)  to  obtain  the  constants  A,  B  and 
C.  The  last  loop  over  the  velocity  space  (lines  32-39)  corrects  the  distribution  function 
according  to  Eq.  (26)  and  compute  the  macroscopic  quantities  of  interest. 
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5  Test  case:  driven  cavity  flow 

5.1  Formulation  of  the  problem 

The  driven  cavity  flow  is  a  classical  benchmark  problem.  In  spite  of  its  simple  geometry,  in 
fact,  it  contains  most  of  the  features  of  more  complicated  problems  described  by  kinetic 
equations  Varoutis  and  Sharipov  (2008).  In  the  following,  we  restrict  to  the  spatially 
two-dimensional  case.  We  thus  consider  a  monatomic  rarefied  gas  contained  in  a  square 
enclosure  with  lenght  L.  All  the  walls  are  kept  at  uniform  and  constant  temperature  To. 
Initially,  the  gas  is  supposed  to  be  in  equilibrium  with  density  no  and  temperature  T0. 
The  flow  is  driven  by  the  uniform  translation  of  the  lid  of  the  cavity  with  velocity  Vw.  We 
describe  the  dynamics  of  the  gas  by  Eq.  (7)  and  assume  that  atoms  which  strike  the  walls 
are  re-emitted  according  to  the  Maxwell’s  scattering  kernel  with  complete  accomodation, 
Eq.  (13). 

We  choose  the  reference  mean  free  path  Ao  to  define  a  dimensionless  position  as  x/Xo, 
being 

Ao  =  Vo/Po(nRT0/2)1/2 

and  Ho  is  the  viscosity  of  the  hard  sphere  gas  Chapman  and  Cowling  (1990).  Likewise 
the  characteristic  time  is  given  by  Xo/Vq,  where  Vo  —  (2i?T0)1//2 .  The  cavity  flow  problem 
has  been  solved  for  three  values  of  the  rarefaction  parameter  6  =  0.1, 1, 10,  being  6  the 
reciprocal  of  the  Knudsen  number  6  =  L/Xo ■  Since  the  proposed  method  of  solution  is 
particularly  effective  in  capturing  small  deviations  from  equilibrium,  we  set  the  lid  velocity 
to  Vw/V0  =  0.01.  The  computations  described  in  below,  hence,  refer  to  very  low  Mach 
number  driven  cavity  flows.  The  square  cavity  has  been  divided  into  Nr  =  Nx  x  Ny 
uniform  square  cells,  with  Nx  =  Ny.  Likewise  the  velocity  space  has  been  divided  into 
Nv  =  NVx  x  NVy  x  NVz  with  NVx  =  NVy  =  NVz.  Since  the  deviation  form  equilibrium 
is  supposed  to  be  small,  a  cubic  velocity  space  has  been  constructed  by  distributing  the 
velocity  nodes  along  each  velocity  component  in  the  interval  [— 3Vo,3Vo].  In  order  to 
achieve  a  faster  convergence  of  the  solutions  in  the  velocity  space,  the  lcnghts  of  the  cells 
are  uniformly  stretched  with  a  progression  ratio  rv  according  to  the  relations  A va(ja)  = 
rvAva(ja  ~  1),  being  the  smaller  cells  located  at  the  origin  of  the  velocity  space.  More 
precisely,  it  has  been  chosen  Nv  =  8000  and  rv  =  0.973  for  6  =  0.1  and  <5  =  1  whereas 
Nv  =  5832  and  rv  =  0.986  for  6  =  10  The  number  of  collisions  used  in  the  Monte  Carlo 
evaluation  of  the  collision  integral  have  been  varied  with  the  rarefaction  parameter.  In 
particular,  it  has  been  set  Nt  =  1024  for  8  —  0.1,  Nt  —  6144  for  6  —  1,  Nt  —  8192  for 
S  =  10.  Finally,  the  time  step  has  been  determined  by  requiring  that  Cux  =  Cu^  =  0.5. 

5.2  Results  and  discussion 

In  this  section,  we  first  carry  out  a  convergence  analysis  of  the  method  in  the  physical 
space  and  then  we  investigate  the  parallel  performances  of  the  code. 

In  order  to  establish  the  convergence  rate  we  compute  two  global  flowfield  properties, 
namely  the  mean  dimensionless  shear  stress  on  the  moving  wall,  D,  and  the  dimensionless 
flow  rate  of  the  main  vortex,  G.  The  two  quantities  are  defined  as 

D  =  ^Jo  n xy(x,L\t)dx,  G  =  ^J^  \ux(L/2,y\t)\dy  (28) 
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Figure  1:  Absolute  relative  error  on  (a)  drag  coeffi¬ 
cient  and  (b)  mean  flow  rate  for  5  =  0.1  (circles),  <5=1 
(squares)  and  5  =  10  (triangles)  versus  the  size  h/5  of 
the  physical  grid.  Lines  are  the  least-mean  square  fit 
of  the  results.  Nv  =  8000,  rv  =  0.973,  Nt  =  1024  for 
5  =  0.1;  Nv  =  8000,  rv  =  0.973,  Nt  =  6144  for  5  =  1; 
Nv  =  5832,  rv  =  0.986,  Nt  =  8192  for  5  =  10. 


The  absolute  relative  error  in  the  stationary  values  of  D  and  G  are  shown  in  Figs  la  and 
lb,  respectively,  versus  the  spatial  grid  size,  h/5  =  1  /Nr,  and  for  5  =  0.1  (circles),  5  =  1 
(squares)  and  5  =  10  (triangles).  The  exact  values  of  D  and  G,  which  are  referred  to  as  De 
and  Ge ,  have  been  extrapolated  from  the  linear  fit  of  the  results  when  h  — >  0.  The  linear 
behaviour  of  the  absolute  relative  errors  demonstrates  that  the  results  are  in  the  asymp¬ 
totic  range  of  convergence  and  the  method  is  first  order  accurate  Salas  (2006).  The  finest 
physical  grid  size  provides  predictions  which  are  accurate  only  within  few  percent.  More 
precisely,  the  largest  error  in  D  is  of  the  order  of  4%  and  is  attained  at  5  =  10  whereas 
the  one  in  G  is  2%  at  5  =  0.1.  The  error  is  mainly  due  to  the  physical  and  velocity 
discretizations.  As  a  matter  of  fact  the  statistical  error  associated  with  the  finite  sample 
size  used  in  the  Monte  Carlo  evaluation  of  the  collision  integral  does  not  affect  the  results 
significantly.  The  standard  deviation  of  D  and  G  with  respect  to  their  averaged  values  in 
stationary  conditions,  in  fact,  is  negligible  small.  For  instance,  the  standard  deviation  of 
D  at  5  =  1  from  its  long-term  mean  value  is  less  than  0.05%.  The  grid  resolution  in  the 
physical  and  the  velocity  domains  are  the  more  accurate  discretization  compatible  with 
the  GPU  global  memory  constraint,  i.e.,  Nv  =  8000  and  Nr  =  25600  for  5  =  0.1, 1  and 
Nv  =  5832  and  Nr  =  36864  for  5  =  10.  Therefore,  in  order  to  improve  the  accuracy  of  the 
numerical  solutions,  we  have  adopted  a  nonuniform  grid  in  the  physical  space.  The  lenghts 
of  the  physical  cells  are  uniformly  stretched  according  to  the  relations  A Xix  =  rxAxlx^\ 
and  A yiy  =  ryAyiy-i  with  progression  ratios  that  depend  on  the  rarefaction  parameter, 
rx  =  ry  =  0.990  for  5  =  10  and  rx  =  ry  =  0.996  otherwise.  The  smaller  cells  are  located 
close  to  the  upper  corners  of  the  cavity  where  severe  gradients  are  anticipated.  All  the 
results  which  follow  have  been  obtained  with  this  discretization.  Table  1  compares  the 
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Figure  2:  Profiles  of  the  dimensionless  (a)  horizon¬ 
tal  mean  velocity  along  the  vertical  line  crossing  the 
center  of  the  cavity,  and  (b)  vertical  mean  velocity 
component  along  the  horizontal  line  crossing  the  cen¬ 
ter  of  the  main  vortex.  Solid  and  dashed  lines:  nu¬ 
merical  solutions  obtained  by  solving  Eq.  (7)  with  the 
semi-regular  method  for  5  =  10  and  5  =  0.1,  respec¬ 
tively.  Circles  and  squares:  numerical  solutions  re¬ 
ported  in  Ref.  Varoutis  and  Sharipov  (2008)  for  S  —  10 
and  d  =  0.1,  respectively.  Nv  =  8000,  rv  =  0.9994, 
Nt  =  1024,  Nr  =  25600,  rx  =  ry  =  0.996  for  5  =  0.1; 
Nv  =  5832,  rv  =  0.970,  Nt  =  8192,  Nr  =  36864, 
rx  =  ry  =  0.990  for  <5  =  10. 


predictions  of  the  mean  stationary  values  of  D  and  G  with  the  extrapolated  exact  values, 
De  and  Ge,  and  the  results  reported  in  Ref.  Varoutis  and  Sharipov  (2008)  where  the  lin¬ 
earized  BGKW  equation  has  been  solved  with  a  discrete  velocity  method.  The  accuracy 
of  the  numerical  solution  of  the  non-linear  Boltzmann  equation  can  be  estimated  to  be 
within  2%.  The  agreement  with  the  BGKW  results  is  good,  which  is  not  unexpected. 
Since  the  velocity  of  the  lid  is  small,  in  fact,  the  gas  is  in  a  weakly  nonequilibrium  state 
and  the  solution  of  the  non-linear  Boltzmann  equation  approaches  the  solution  of  the 
linearized  BGKW  equation.  Figures  2a  and  2b  show  the  profiles  of  the  dimensionless 
horizontal  component  of  the  velocity,  Vx/Vw,  along  the  vertical  line  crossing  the  center  of 
the  cavity  and  the  dimensionless  vertical  component  of  the  velocity,  Vy/Vw,  along  the  hor¬ 
izontal  line  crossing  the  center  of  the  main  vortex  which  forms  in  the  cavity,  respectively. 
Lines  are  the  solutions  of  the  non-linear  Boltzmann  equation,  whereas  symbols  are  the 
results  reported  in  Ref.  Varoutis  and  Sharipov  (2008).  The  results  refer  to  two  different 
values  of  the  rarefaction  parameters:  5  =  0.1  (dashed  lines  and  squares)  and  5  =  10  (solid 
lines  and  circles).  The  agreement  is  very  good. 

Although  the  convergence  analysis  has  been  performed  by  referring  to  quantities  evalu¬ 
ated  in  stationary  conditions,  it  is  important  to  point  out  that  the  proposed  method  of 
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5 

D 

De 

D* 

G 

Ge 

G* 

0.1 

0.6712 

0.6815 

0.676-0.678 

0.0955 

0.0977 

0.0973-0.0976 

1 

0.6266 

0.6389 

0.625-0.631 

0.1017 

0.1039 

0.104-0.105 

10 

0.4096 

0.4176 

0.412-0.415 

0.1427 

0.1451 

0.145-0.145 

Table  1:  Drag  coefficient,  D,  and  mean  flow  rate,  G,  versus  the  rarefaction  parameter, 
5.  De  and  Ge  represent  the  estimated  exact  values;  D*  and  G*  are  reference  values  from 
Varoutis  and  Sharipov  (2008). 


solution  provides  accurate  results  in  the  unsteady  regime  as  well.  As  an  example,  Figure 
3  shows  the  evolution  of  D  during  the  simulation  for  5  =  10.  It  is  important  to  stress  that 
a  similar  result  would  be  difficult  to  obtain  with  a  particle  method  where  computationally 
expensive  ensemble  averanging  are  needed  to  provide  smooth  macroscopic  fields.  The 
performance  of  the  GPU  implementation  is  compared  against  the  single-threaded  version 
running  on  the  CPU  by  computing  the  speed-up  factor  S  =  TCPV/TGPV,  where  TCPU  and 
TGPV  are  the  times  used  by  the  CPU  and  GPU,  respectively.  Times  are  measured  after 
initial  setup,  and  do  not  include  the  time  required  to  transfer  data  between  the  disjoint 
CPU  and  GPU  memory  spaces. 

We  analyse  separately  the  streaming  and  the  collision  step,  the  latter  comprising  both 
the  sampling  and  the  collision  kernel.  Figure  4  reports  the  obtained  speed-up  data  as  a 
function  of  the  number  of  spatial  grid  points  Nr.  The  curve  refers  to  6  =  1.  The  speed-up 
grows  rapidly  with  Nr  and  than  levels  up  at  about  450  if  Nr  approximately  exceeds  104. 
This  behavior  is  the  result  of  the  parallel  set  up  of  the  collision  step  in  Nr  independent 
threads  one  for  each  cell  of  the  physical  space.  As  discussed  below,  the  collision  step 
absorbs  most  of  the  computational  resources  and  its  execution  strongly  affects  the  overall 
performances.  As  shown  by  the  speed-up  curve  the  GPU  power  is  not  fully  exploited  till 
the  number  of  concurrent  threads  reaches  a  threshold.  Beyond,  the  speed-up  saturates 
and  the  computing  time  approximately  behaves  as  a  linear  function  of  Nr.  This  behaviour 
is  similar  to  the  one  reported  in  Refs.  Elsen  et  al.  (2008);  Anderson  et  al.  (2008).  Figure  5 
shows  the  relative  time  which  is  spent  on  the  streaming  step  (dark  bar)  and  on  the  colli¬ 
sion  step  (light  bar)  as  well  as  the  total  execution  time  (numbers  over  the  bars)  for  5  =  1. 
As  expected,  the  collision  step  is  more  time  consuming  than  the  streaming  step  which 
takes  at  most  36%  of  overall  computing  time. 

A  strongly  simplified  evaluation  of  ideal  performances  of  the  streaming  and  collision  step 
can  be  carry  out  as  follows.  A  single  application  of  the  upwind  scheme  requires  the  exe¬ 
cution  of  11  floating  point  operations  and  2.3  accesses  to  the  global  memory.  The  GPU 
delivers  715.4  GFLOPs  but  the  transfer  rate  to/from  the  main  memory  is  limited  to  111.9 
GB/s.  Since  in  the  case  of  streaming  the  ratio  of  number  of  floating  point  operations  to 
the  number  of  bytes  accessed  is  low  (11  :  9.2),  it  is  reasonable  to  obtain  the  number  of 
floating  point  operation  per  second  from  the  transfer  rate  alone.  Hence,  the  ideal  number 
of  GFLOPs  can  be  obtained  by  assuming  that  11  floating  point  operations  will  be  exe¬ 
cuted  in  the  time  required  to  transfer  9.2  bytes  from  the  main  memory.  Accordingly,  this 
simple  argument  yields  an  ideal  performance  of  the  streaming  step  of  133  GFLOPs.  A 
similar  performance  analysis  can  be  applied  to  the  collision  step  which  encompasses  both 
the  sampling  and  collision  kernel.  In  order  to  update  the  distribution  function  and  com- 
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5  TEST  CASE:  DRIVEN  CAVITY  FLOW 


Figure  3:  Drag  coefficient  over  the  moving  wall,  D, 
versus  dimensionless  time,  t/to-  6=1. 


pute  the  macroscopic  quantities  of  interest,  the  number  of  FLOPs  that  the  GPU  executes 
at  any  given  time  step  and  for  each  cell  in  the  physical  space  is  the  sum  of  two  contribu¬ 
tions.  The  first  is  proportional  to  the  number  of  velocity  cells,  Nv,  and  the  second  one  is 
proportional  to  the  number  of  collisions,  Nt,  used  to  evaluate  the  collision  integral.  The 
resulting  number  of  FLOPs  for  each  time  step  are  of  the  order  of  NxNy(80Nv  +  12 Nt). 
Likewise  the  number  of  bytes  accesses  to  the  global  memory  per  time  step  is  of  the  order 
of  Nx  Ny(m*a  +  64 W)  Arguments  similar  to  those  above  lead  to  estimate  an  ideal  per- 
formace  of  the  collision  step  of  about  174.7  GFLOP/s. 

Timing  the  execution  of  the  separate  kernels  and  counting  the  number  of  associated  float¬ 
ing  point  operations  provides  the  real  performance.  The  results  are  reported  in  Fig.  6 
where  GFLOPs  are  shown  as  a  function  of  the  number  of  grid  points,  Nr.  Solid  line  with 
circles,  dashed  line  with  squares  and  dot-dashed  line  with  triangles  are  the  measured  per¬ 
formances  of  the  streaming  step,  the  collision  step  and  the  overall  code,  respectively.  It 
is  possible  to  note  that  the  performance  of  the  streaming  step  grows  with  Nr  and  quickly 
levels  at  about  30  GFLOPs,  approximately  one  third  of  the  estimated  ideal  performance. 
The  difference  can  be  justified  by  observing  that  the  real  CLIDA  implementation  of  the 
finite  difference  scheme  is  not  free  from  thread  divergence  and  ancillary  tasks  whose  ef¬ 
fects  can  be  evaluated  with  difficulty  Micikevivius  (2009).  The  collision  step  performance 
closely  patterns  the  speed-up  behavior,  that  is  it  rapidly  grows  in  the  range  Nr  <  104  and 
then  levels  up  at  about  140  GFLOP/s,  reasonably  close  to  the  theoretical  estimate.  The 
collision  kernel  performs  better  than  the  streaming  kernel  due  to  its  higher  FLOP  to  mem¬ 
ory  operation  ratio  which,  in  turn,  allows  a  more  efficient  use  of  GPLT  computing  power. 
The  absence  of  thread  divergence  is  also  a  feature  which  positively  affects  performances. 
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5.2  Results  and  discussion 


Figure  4:  Overall  speed-up,  S,  versus  number  of  cells 
in  the  physical  space,  Nr ;  8  =  1,  Nv  —  8000,  Nt  = 
6144,  Nr  =  25600 


Figure  5:  Relative  time  spent  on  the  streaming  step 
(dark  bar)  and  on  collision  step  (light  bar).  The  num¬ 
bers  above  the  bars  refer  to  the  total  execution  time 
expressed  in  seconds.  8  —  1,  Nv  —  8000,  Nt  =  6144, 
Nr  =  25600 
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5.2  Results  and  discussion 


5  TEST  CASE:  DRIVEN  CAVITY  FLOW 


Figure  6:  GFLOP/s  versus  the  number  of  cells  in  the 
physical  space,  Nr.  Solid  line  with  circles:  streaming 
step;  dashed  line  with  squares:  collision  step;  dot  and 
dashed  line  with  triangles:  overall  code.  8  —  1,  Nv  — 
8000,  Nt  =  6144,  Nr  =  25600. 
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6  CONCLUSIONS 


6  Conclusions 

In  this  notes  we  have  described  the  possibility  of  exploiting  the  computational  power  of 
modern  GPUs  to  simulate  nonequilibrium  rarefied  gas  flows.  The  full  nonlinear  Boltzmann 
equation  has  been  solved  by  means  of  a  semi-regular  method  which  combines  a  finite 
difference  discretization  of  the  free-streaming  term  with  a  Monte  Carlo  evaluation  of  the 
collision  integral.  This  method  of  solution  is  ideally  suited  for  the  SIMD-likc  architecture 
provided  by  the  commercially  available  GPUs.  The  two  dimensional  driven  cavity  flow 
has  been  used  as  a  benchmark  problem.  The  results  lead  to  concluding  that  the  porting 
of  the  sequential  code  onto  GPUs  allows  a  reduction  of  the  computing  time  of  two  orders 
of  magnitude,  being  the  observed  speed-up  as  high  as  400.  Although  the  test  problem 
examined  here  has  clearly  shown  that  the  size  of  physical  memory  is  the  main  obstacle 
toward  the  application  to  complex  two  or  three-dimensional  flows,  the  numerical  method 
described  above  can  be  reformulated  as  a  less  memory  particle  scheme  in  many  ways. 
Hence,  the  present  work  and  results  are  a  first  step  toward  the  construction  of  a  more 
flexible  and  efficient  method  for  the  numerical  solution  of  kinetic  equations. 
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9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 


Algorithm  1  GPU  pseudo-code  of  the  two-dimensional  streaming  kernel 

Require:  Cu,.  is  Courant  number  along  x  direction 

Require:  Cuy  is  Courant  number  along  y  direction 

Require:  Bx  is  the  number  of  threads  in  x  direction 

Require:  By  is  the  number  of  threads  in  y  direction 

Require:  tx  is  the  index  of  the  thread  in  x  direction 

Require:  ty  is  the  index  of  the  thread  in  y  direction 

Require:  /sh  is  a  matrix  of  dimensions  (Bx  +  1)  x  (By  +  1)  in  the  shared  memory 
1:  for  I  by  =  Ny/By  —  1  to  0  do 
for  Ibx  =  Nx/ Bx  —  lto  0  do 
^  tx  A  tBxIbx 

Xy  i  ty  "|“  Bylfry 

fsh  (tx  +  1,^  +  1) 
if  ty  ==  0  then 
if  iy  —  1  <  0  then 

fSh(tx,ty)  <-  boundary Flux 
else 

fsh(tX,ty )  4= 

end  if 
end  if 

if  tx  ==  0  then 
if  ix  —  1  <  0  then 

fsh(tx,ty)  <—  boundaryFlux 
else 

.cri 

J  'tx  1 J ty  ij 


fi 


'tx  I'ty  'ij 


fshi^xi ty) 

end  if 
end  if 

syncthreads 

fl'g  t  (1  C  Ux  Cu^j/A  [tx  1)  ty  -)-  1)  -|-  Criy  fshifxi  ty  -|-  1)  A  C  U;p  ,/sh  (t  X  A  1,  ty) 

f  fH+1  . 

Jig  =?  JlX,ly,j 

Ibx  t  Ibx  1 

end  for 

Ihii  ^  Iby  f 


lby 

end  for 
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Algorithm  2  GPU  pseudo-code  of  the  sampling  kernel 
Require:  %  is  the  global  index  of  thread 
1:  v  <r- BoxMulller 
2:  v i  ^—BoxMulller 
3:  k  4—  UnitSphere 
4:  Vr  4—  Vi  —  V 
5:  V*  4—  v  +  (vr  ■  k)k 
6:  v\  4—  V\  —  (vr  ■  k)k 
7:  cells(v)  =>  I(i) 

8:  cells(vi)  =>•  I\{i) 

9:  cells(v*)  =>-  I*(i) 

10:  cells(v*)  =>-  I*(i) 

11:  ()j  4—  dt‘Kd2n^{AVj^j)~1\vr  ■  k\ 

12:  (jjl  4—  dtiid2,ril(tS.Vjl<$>jl)-1\vr  ■  k\ 

13:  (jj*  4—  dt'Kd2nl(AVj*^j*)~1\vr  ■  k\ 

14:  gj*  4—  dtnd2nl(AVji$j*)-1\vr  ■  k\ 

15:  gj  =t  C(i ) 

16:  gjl  =>  C'i(i) 

17:  gj*  =>-  C*(i) 

18:  fi'j*  =>■  C*(i) 
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Algorithm  3  GPU  pseudo-code  of  the  collision  kernel 


Require:  i  is  the  global  x  index  of  the  thread  inside  the  grid 
Require:  j  is  the  global  y  index  of  the  thread  inside  the  grid 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 


for  all  j  do 

h  4=  h 7+1 
P  t—  p  +  3>o,j  h 

U  <—  U  +  Vi  $0 


j  h 


e  <—  e  +  \vj\2  $0^  h 

end  for 

for  m  —  1  to  Nt  do 
h  4=  hn+' 


j  n+i 

K^m) 


hi 
h* 

h\ 

g  i —  h  h\  -|-  e  h  h\ 

h  <—  h  —  C(m )  g 
hi  <—  hi  —  Ci  ( m )  g 
h*  «-  h*  +  C*(m)  g 
hi  «-  h\  +  cf(m)  g 
h  = 

hi 
h* 


Kjm 

>  KXU 


h  i  => 

end  for 
for  all  j  do 

h  <=  h#1 

Qu  t —  an  +  d'o.j  h 

aV2  t —  0-12  +  Vj  $0,J  h 
°13  t—  Oi3  +  \Vj\2  d’oj  h 

/ /  others  moments  of  the  distribution  function 

end  for 

[A,  B ,  C]=linearSolver(n,  n,  e,  an,  oi2,  ai3, . . .) 

for  all  j  do 

h  4=  h"X 

h  <—  l/e((l  +  eh)(l  +  A  +  B  ■  Vj  +  Cv2-  —  1) 
h 


KT 


/ /  compute  macroscopic  quantities 


end  for 


RTO-EN-AVT-194 


7-29 


8  APPENDIX:  CUDA  PSEUDO-CODES 


7-30 


RTO-EN-AVT-194 


